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Abstract — We offer a theoretical validation of the curse 
of dimensionality in the pivot-based indexing of datasets for 
similarity search, by proving, in the framework of statistical 
learning, that in high dimensions no pivot-based indexing 
scheme can essentially outperform the linear scan. 

A study of the asymptotic performance of pivot-based index- 
ing schemes is performed on a sequence of datasets modeled 
as samples picked in i.i.d. fashion from a sequence of metric 
spaces. We allow the size of the dataset to grow in relation 
to dimension, such that the dimension is superlogarithmic but 
subpolynomial in the size of the dataset. The number of pivots is 
sublinear in the size of the dataset. We pick the least restrictive 
cost model of similarity search where we count each distance 
calculation as a single computation and disregard the rest. 

We demonstrate that if the intrinsic dimension of the spaces 
in the sense of concentration of measure phenomenon is linear 
in dimension, then the performance of similarity search pivot- 
based indexes is asymptotically linear in the size of the dataset. 

Keywords-Data structures; Similarity search; Curse of di- 
mensionality; Concentration of Measure; 



I. Introduction 

The problem of similarity search in databases is addressed 
by building indexing schemes of various types |Cia97|, 
OChaOll . nCha05l . MZez05l . The goal of such structures 
is that a search algorithm can exploit them to perform 
similarity search in time sublinear in the database size. That 
indexing schemes do not scale well with increasing dimen- 
sion has been referred to as "the curse of dimensionality" 
| |Bey99| , |lnd04|. 

We feel that in order to gain a better insight into the 
nature of the curse of dimensionality, it is necessary to 
have a precise mathematical understanding of the geometric 
and algorithmic aspects of what happens in genuinely high- 
dimensional datasets. With this purpose, we have chosen 
to analyse one of the most popular indexing schemes for 
similarity search, the one based on pivots IIBus03l . llChaOll . 
The mathematical setting for our analysis is a rigorous model 
of statistical learning theory ||Blu89l . ||Dev97| . |Vap98| , 
0VidO3l . where datasets are drawn randomly from domains 
of increasing dimension. 

This probabilistic setting is similar to that used in a 
previous asymptotic analysis of similarity search IISha06l . 
We also adopt a cost model where we count distance 
computations only, in line with l,Sha06J . Unlike this previous 



work, we let both the dimension d and the size of the dataset 
n grow as described in MInd04l . We also make the distinction 
between the dataset and the data space mathematically 
explicit. In particular we emphasize that statements of the 
type "all indexing scheme will degenerate to linear scan with 
increasing dimension" (to paraphrase r Web98l ) will always 
need to be qualified with estimates of the probability. For it 
is not impossible to sample a hypercube uniformly and come 
up with a "distribution with a million clusters" fShaOS). 

Our analysis is done on a sequence of (data) spaces that 
exhibit the concentration of measure phenomenon IIGro83l . 
fMil861 (Sect. |V]i, a concept linked to what is called in 
I Sha06 1 workloads with vanishing variance. It is also in 
terms of this concentration of measure that we define the 
dimension d. To show that the above situation with a million 
clusters cannot happen (too often) we study the convergence 
of empirical probabilities to their true values using a result 
from Statistical Learning Theory fVap^Sl. We introduce a 
property of a sequence of spaces which is sufficient to invoke 
this result. 

The conclusion of our analysis (Sect. IVIIIl l is that for 
high dimensional datasets the class of pivot-based indexing 
schemes cannot significantly outperform the baseline linear 
scan of checking every element of the database. 

II. Metrics, measures, and datasets 

We model the dataset as a sample of a metric space with 
measure. A metric (or: distance) on a set X will be denoted 
by p, and we will not remind the definition. The (open) ball 
of radius r and centre g in a metric space (il, p) is denoted 

Br{q) := {lj G n\p{q,uj) < r}. 

The family B = Bn of Borel subsets of a metric space 
{n, p) is the smallest family containing all the open balls and 
the entire set Q, and closed under complements and countable 
unions. 

A (Borel) probability measure on the space is a 

function p : Bn [0, 1] s.t. p{ft) = 1, and which is 
countably additive: for a sequence Bi,B2 ■ ■ ■ , of pairwise 
disjoint sets from B, p{\J- B^) = Y,. p{Bi) 

A dataset however large is always a finite subset X C il. 
It naturally inherits the metric p\x and in place of p sup- 
ports the normalized counting (also: empirical) probability 



measure: 



\Anx\ 



We will treat X as a sample of 57 with regard to the measure 
ji, that is, a sequence of i.i.d. random variables {Xi) ~ fi. 

Given a domain together with a dataset X C fl, we can 
perform several kinds of similarity queries, with our focus 
on two main ones. A k nearest neighbour query consists 
of, given query centre (key) g e 51, finding the k closest 
elements in X to q. To answer a range query is to find all 
the elements in X within distance r from q. 

To distinguish between X and Q, formally is necessary 
precisely because a typical search will begin with a centre 
q G 57, with q e X as well being rare. 

To answer a similarity query we can revert to the strategy 
of looking up every element in x e X and calculating 
p{q, x). Following fChaOll, we adopt the number of distance 
calculations p{q, x) as the unit of time complexity. In that 
framework we will call the above strategy a linear scan. 

An indexing scheme is a structure whose aim is to speed 
up the execution of similarity queries on a particular dataset, 
typically consisting of some pre-calculated values and an 
algorithm. 

III. The curse of dimensionality 

An often repeated observation is the inability of many 
existing algorithms to deal with high dimensional datasets 
(e.g. PBey99|)- a phenomenon described as the curse of 
dimensionality, when performance drops exponentially as a 
function of dimension. 

The concept of dimension in a general metric space with 
measure is less precise. Clearly it has to obey our intuition in 
Euclidian space so for example a plane in the lO-dimensional 
space R" is still 2-dimensional, and it would be desirable 
for a uniformly distributed ball in to be d-dimensional. 

A version of intrinsic dimension was proposed by the 
authors tChaOlb l as 

2Yai{p{x,y))' 

where x,y ^ p, the distribution of points in 57. It is based 
on the observation that if the histogram of distances from 
q to points in X shows a lot of "concentration", this will 
be a hard query to process as it is harder to rule out points 
using a triangle inequality type approach. That the above 
dimension is asymptotically equal to the usual notion in 
Euclidian spaces is mentioned in IChaOlbl , where a result 
on time complexity of search in term of d is also stated. It 
is a lower bound on the order of d ln(n). 

In this article, we will use another approach to the intrinsic 
dimension, elaborated in fPesOSl and also based on the 
phenomenon of concentration of measure, cf. Sect. |V] 

In general the time complexity we are looking for in 
search depends both on dimension (henceforth we will sim- 
ply call it d) and size of dataset n. An asymptotic analysis of 



the performance of indexing schemes will therefore involve 
both d — > oo and n ^ oo. Search in sublinear time in n is 
an obvious goal: 

querytime = o{n). 

where by querytime we mean the average time it takes for 
a similarity query to execute, time measured in distance 
computations. 

Storage is also important, with at most polynomial storage 
allowed in theoretical analysis (though in practice even 
may be too much): 



storage 



,o(i) 



For the pivot-based indexing scheme the storage will be 
measured by the number of distances stored. 

We will follow an approach in the authoritative survey by 
IIInd04l and focus on a particular range for rate of growth 
for dimension d, superlogarithmic but subpolynomial in n: 



uj{logn) 



Ml) 



(1) 



(2) 



This choice of bounds is due to a case study of the Hamming 
cubes. Recall that the Hamming cube Y.'^ of dimension d is 
the set of all binary sequences of length d, and the distance 
between two strings is just the number of elements they 
don't have in common divided by d: 



(the normalized Hamming distance). 

In the case where d grows slowly, d = 0(log?i), all 
possible queries can be pre-computed and stored without 
breaking the polynomial storage requirement. Hence the 
lower bound. The upper bound results from the observation 
that if d grew so fast that n — d'-"^^\ a sequential scan would 
be polynomial in d and so acceptable. 

Summarizing: The goal of finding a scalable index is to 
find polynomial (preferably degree less than 2) n storage 
algorithm that allows search in polynomial d time. 

This stands in contrast to the curse of dimensionality 
conjecture, as stated in Illnd04ll : 

If d — Lu([ogn) and d = n°^^\ any sequence of indexes 
built on a sequence of datasets Xd C S^; allowing exact 
nearest neighbour search in time polynomial in d must use 
space. 

The conjecture remains unproven in the case of general 
indexing schemes. The goal of this article is to show that 
at least for pivot-based indexes the above conjecture holds 
even in a strengthened form. 



IV. Pivot-based indexing 



V. Concentration of measure 



We will focus on one class of indexing schemes, the pivot- 
based index (e.g. AESA, MVPT, BKT,...see IChaOll and 
iZezOSi ). The index is built using a set of pivots {pi . . .pk} 
from n, and consists of the array of n x fc distances 

p{x,pi), I ^ i ^ k, X G X. 

Given a range query with radius r and centre q, the k 
distances p{q,pi) . . . p{q,pk) are computed so that p{q,x) 
can be lower-bounded by the triangle inequality: 



p{q,x) ^ sup \p{q,Pi 

l<i<k 



P{x,Pi)\- 

It is useful to think of a new distance function, 

Pk{q,x):^ sup \p{q,pi) - p{x,pi)\. 

The fact that p{q,x) ^ pk{q,x) can be used to discard 
all X satisfying pk{q,x) > r. For the remaining points, the 
algorithm will verify if p{q, x) < r. If it is true, the point is 
returned. 

We will only analyze range queries; fc-nearest neighbour 
queries can always be simulated by a range query of suitable 
radius 0ZezO5l . 

For a query centre q denote by Cq all the points of X 
satifying pk{q,x) > r, i.e. all the elements to be discarded. 
Making Cq large is the primary way of cutting the cost of 
search in our cost model. Of course we can achieve this 
trivially with a very large number of pivots. This will defeat 
the purpose however as 

Cost of range search = k + \X\Cq\ 

The most often used solution is to keep adding pivots as 
long as it is found experimentally to decrease the cost of 
search. If k is small, on the order of log n (as often space 
limitations require), the most important component of cost 
becomes the size of X 

Cq and this is where the choice of pivots would seem to 
matter Various approaches to pivot selection have been in- 
vestigated in I Bus031 . The empirical results seem to suggest 
that a moderate reduction in the number of distance compu- 
tations can be achieved, although the relative improvement 
drops with increasing dimension. 

Remark IV. 1 (The number of pivots fc.). There are indexing 
schemes, like AESA IIZez05l where k = n. However, in 
many situations ri^ storage is not practical, and it has even 
been argued that under certain assumptions the optimal 
number of pivots is on the order of Inn IChaOll . It is also 
true that the query algorithm we analyze has complexity at 
least k so only schemes with fc = o{n) can claim to beat 
the curse of dimensionality. 



Perhaps the most compelling way to describe the concen- 
tration of measure phenomenon is to draw a picture. We will 
attempt to draw the (surface of the) unit sphere for various 
d, by sampling points and projecting them onto a flat surface. 
Any orthogonal projection, say taking the first 2 coordinates, 
will give the picture similar to that in Figure [T] Under the 
sampling approach, it appears that high dimensional spheres 
are "small" even if we know their diameter to be a constant 
irrespective of d. 
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Figure 1. Projection of randomly sampled spheres of various dimensions 
d=10, 20, 50, 100 



This phenomenon is observed in a much greater variety of 
situations and formalized as follows. Given a metric space 
define the e-neighborhood of A C as 

Af = {ll> E il\p{uj,a) < e for some a G A}. 

Definition V.l. The concentration function a — a^t of a 
metric space with measure (il, p, p) is defined as 

a{0) = 1/2, 

a(e) = sup{l C n,p{A) ^ \} , e > 0. 



To put it less formally, we are trying to measure how 
much of the space remains after "fat" is added to a somewhat 
large set in the form of an e neighborhood. When very little 
remains, we say that the concentration of measure takes 
place. 

Example V.l. The spheres S'^ in M''+^, taken with the 
geodesic or Euclidian distance and the normalized invari- 
ant measure, produce a concentration function bounded as 



follows ||Mil86| : 
In this case an exact expression for the concentration 



Q.B 
0.6 




Figure 2. The concentration functions of various spheres 

function is known IIMil86L based on the fact that the half- 
sphere, among all subsets of measure at least 1/2, will 
always produce the smallest e-neighborhood, no matter the 
e. A plot of the resulting concentration functions, for several 
values of d, appears in Figure |2] 

Definition V.2. A sequence of spaces {^d)dLi is a normal 
Levy family ||Mil86l if C, c > exist such that 

a{e) < Ce-"'''. 

Example V.2. The Balls B^, taken with the Euclidian dis- 
tance and the uniform probability measure (d-dimensional 
Lebesgue), form a normal Levy family. 

Example V.3. The Hamming Cubes form a normal 
Levy family under the normalized Hamming metric and the 
uniform measure. 

The concentration of measure can be equivalently de- 
scribed in terms of Lipschitz functions. Recalling that a 
function / : ^ K is 1 -Lipschitz if 

yx,yen, \fix)- fiy)\i^p{x.,y). 

Recalling further that a median of function / : 
(O, p, /i) ^ M is any number M satisfying: 

s$ M} 1/2 and ^ A/} > 1/2. 

It is then relatively straightforward to prove: 

Theorem V.3 (Cf. IIMil86l ). For a 1-Lipschitz function f 
defined on space (fJ, /i, p): 

Ve > 0, p{uj\ \f{uj) - M\ > e} < 2a(e). 

The relevance of concentration of measure in indexing is 
noted in BPesOOI . Observe that 

p{-,p) -.0. ^M.: LO ^ p{^,p) 

is 1-Lipschitz for any p and in particular a pivot. Hence 
Theorem IV.3I can be applied to obtain a bound on the 
deviation from the median M ~ Mp of function p{-,p): 

Vr>0, p{uj\ \p{uj,p) - M\> r} <2a{r). 



We combine these statements for all pivots pf. 

Vr>0, p{uj\ sup \p{uj,pi) - Mi\> r/2} <2ka{r/2), 

as the probability of the union can always be upperbounded 
by the sum of the probabilities. We note that no assumptions 
about independence are used: the sequence {pi) can be 
chosen in any way. Next, for all query centres q except a 
set of measure < 1 — a{r/2): 

Vr > 0, p{u!\ sup \p{u!,pi)-p{q,pi)\ > r} < 2ka{r/2). 

We could introduce a set 

Cg = {uj\pk{q,oj) > r} 

and think of Cg as the observation of Cg under p^. To recap: 
for a randomly chosen query centre and each query radius 
r > 0, with probability > 1 — a{r/2), 

p{Cg) < 2fca(r/2). (3) 

Remark V.4. We point out that Theorem IV. 3 1 applied to the 
distance function p gives a bound on the variance of p{-,p)- 
This, together with a "uniformity of view" type assumption 
as in IChaOlbll leads us to conclude that the variance of 
p{-, •) converges to zero in Levy families. This argument 
can be formalized to demonstrate the connection to the 
assumption of vanishing variance on the sequence of data 
spaces made in IISha06ll . In our view that assumption is just 
a variation on concentration of measure. The differences lie 
in certain technical details, like the division by expectation 
of p{-,-) in ltSha06J . Here we simply avoid the issue by 
normalizing spaces so that the expectation of p(-, •) tends to 
a constant. This normalization also fixes the problem of dis- 
tance to nearest neigbour (e.g. |,Web98] ) as we demonstrate 
in the next section. 

VI. Radius of queries in Levy families 

In our asymptotic analysis, we would like to normalize 
spaces so that the median distance between two points stays 
about the same. Here we will extract consequences for the 
typical radius of a query - which we will assume to be the 
distance to the nearest neighbour of query centre. 

Lemma VI.l (M. Gromov, VD. Milman). llGroSSV Let 

(Jl, p, p) denote a metric space with measure and a its con- 
centration function. Then if A C is such that p{A) > a{'~f) 
for some 7 > 0, it implies that p{A^f) > 1/2. 

Theorem VI.2. Let (fl^, Pd, l^d, X^j'^i be a sequence of 
metric spaces with measure, forming a Levy family, together 
with i.i.d. samples Xd- Assume that n — nd — \Xd\ — 
d°^^\ Furthermore, if Md denotes the median value of 
{pd{uJi,LO2)\L0i S Od}, we assume that Md = 6(1), that 
is, for some fixed Ci , C2 > 0, Vd, ci < Md < C2. 

Let p^J^^\ijj) denote the distance to the nearest neigh- 
bour of u G fid in Xd. Define md to be the median of 



p^^^\uj). Then there exists some C3 > and some D such 
that Vd ^ D, rrid > C3. 

Proof: Assume the conclusion fails, then without loss 
of generality and proceeding to subsequence if necessary, 
iTid 0. By definition of nid, we know that for any d, 



IJ-d 



U ^ I- 

xex ) 



It foUows that 



Ud sup [Id ^ -, 

and so we can find for any d a point ujd G such that 

1 



[Id {BrnA^d}) ^ 



2n' 



If we denote by ad the concentration functions of our spaces 
ilrf we know by assumption the existence of C, c > s.t. 



Vd,arf(e) ^Ce 



Hence we can find d! s.t. ad'(7) < l/2nd' and m^' < ci/ 
where 7 = ci/8 as well. By lemma [VI. 1 1 



1 

^ -. 

"12 



It then follows that 

t^d'{iB„^JcJd'))^)_^l^l-Ce-^-''''' 

that is, since rud' + 27 < 3c/8, 

fid' (B^c/sM) ^l-Ce-=^'^'. 

But diameter (i33f,j/8(a'ci')) ^ 3ci/4, so in ilcc x ild' 
the measure of the set of points {uji,uj2) for which 
Pd'{u!i,uj2) < c\ is at least 

1 - 



2nd' 



obviously contradicting Md' > c\. ■ 
This result frees us from having to consider a radius that 
vanishes as n, d go to infinity. With this achieved, let us 
recap our goal: to show that a large proportion of queries 
are slow, something along the lines of: 



mcdian^^p. ,r i^^niGq,px---Vh(n) ,r{n)) 



-> as n,d — > 00, 
(4) 

where the median is taken over all the queries under consid- 
eration: any q £ fid and any r at least as large as the distance 
to the nearest neighbour of q in X. As well, for each d and 
n — Ud we would like to also consider all possible pivot- 
based index schemes (as long as k is within certain ranges 
we will specify later). So far we have shown, although the 



proof was just sketched (and with the detail about k left out) 
that 



median, 



,),r(n))j — > as n,d — > 00 

(5) 

What we need is to find out when (|5]l implies (|4]l. To do 
so we will summon the powerful machinery of statistical 
learning theory. 

VII. Statistical learning theory 

Statistical learning theory has already been used in the 
analysis and design of indexing algorithms IIKle97l and is a 
vast subject. We will just focus on the generalization of the 
Glivenko-Cantelli theorem due to Vapnik and Chervonenkis. 

Theorem VII.l (Ghvenko-CantelU). Given sample (X) = 
Xi, X2 ■ ■ ■ Xn distributed i.i.d. according to any measure /i 
on M", we have: 



sup |^„(-oo, r] - ^(-00, 



0. 



We can see this statement in terms of the empirical mea- 
sures of particular subsets converging to their true measure. 
This is made clear when we restate the theorem as follows: 



where 



sup I Hn{A) - fJ-{A) 



A = {{— 00, r]\r G 



0, 



(6) 



which makes more apparent a path for extension: to gener- 
alize to other collections of subsets A. 

A collection A "colours" the sample X as follows. Each 
A G A will assign 1 to Xi if Xi G A, and otherwise. We 
denote by N{X) the number of such different colourings 
of X generated by all A £ A. Clearly N{X) ^ 2". What 
is surprising is that in many situations, despite a seemingly 
rich A, we have N{X) < 2". 

Definition VII.2. The growth function G = G_a of a family 
A is defined by 

G{n) = In sup N{X). 

\X\=n 

It is independent of /i and the choice of sample X. 

There are two cases to consider for an upper bound for 
the growth function | Vap98 |: 
« for all n, G{n) = n In 2 

. or, for the largest A such that G( A) = A In 2, 



G{n) 



nh\2 if n ^ A 
A(l + ln(n/A)) if n > A 



This A is the so-called VC dimension and it turns out that 
its finiteness is a necessary and sufficient condition for (|6]l. 
The rate of convergence is as follows ([ |Vap98) p. 148): 



Theorem VII.3. [Vapnik-Chervonenkis] For a collection A 
of subsets of fl, of finite VC dimension A, and any measure 
fi on Q, we have that for any e > 0, 



P 



sup I pL,i{A) - fi{A) I > e 
AeA 



A(l + ln(2n/A)) 



< 



4exp 




The convergence is eventually like exp(— e^n), which is 
again a fast rate of convergence. Since no information about 
the measure /i is incorporated, the left side can be replaced 
by its supremum taken over all possible probability measures 
on the domain ft. 

A natural restatement of these results is to ask how big 
does the sample size n have to be for the expression on the 
left to be less than some ry > 0. Solving for rj and the use 
of some technical inequalities (cf e.g. IIMen03l ) yields: 



128 / ^ , 2e2 , 8 
n > -—- A log h log - 

£^ \ e ry 



(7) 



Calculations of VC dimension have been done for various 
objects (e.g. |Dud84|, P^p98|, |Dev97|): The VC dimension 
of half-spaces {x £ R^x, v) ^ b} in R"^ is d+ 1. The VC 
dimension of all open (or closed) balls in R'' is also d+ 1. 
Axis-aligned rectangular parallelepipeds in R'', i.e. sets of 
form 

[ai,bi] X [02,62] X ... X [ad,bd] 

have a VC dimension of 2c?. 

Our interest is in calculating the VC dimension of all 
possible set of form Cq, the collection of which for a fixed 
k we denote: 

A = Ak = {C,,p,...p,,„,,^(„)|(? G n,p, G f],r > 0} (8) 
As 

C = {cj : sup| \\(^~Pi\\ - \\q-Pt\\ I > r} 

i 

i 

we can proceed through several steps. A set of the form 

{cj : I - ||<? | < r} 

is a "spherical shell," and an intersection of shells is an 
interesection of sets from A^A^, where A is the collection 
of all balls. It is easy to show that given a collection A the 
complement collection A'^ — ^A'^\A G ^} has the same VC 
dimension. The VC dimension of balls was quoted above as 
d + 1, hence the VC dimension of complements of balls is 
d + 1 as well. The VC dimension of the union of the two 
collections is 

(d + 1) + (d + 1) + 1 = 2d + 3, 



as a consequence of a general result IIVid03l : If a collection 
A has VC dimension Aq and a collection B has VC 
dimension A;,, the union AU B has VC dimension at most 

Aa+Ab + 1. 

A result for intersection of sets is mentioned in 0Blu89l : 

Lemma VII.4. For (fi, p) = (R'^jL^), an upper bound on 
the VC dimension of An,,' composed ofk-fold inte re sections 
of elements of a family A ofVC dimension A is 2Afc ln(3fc). 

Hence we can conclude that the VC dimension of Ak for 
the case ft C R" is bounded by 

2{2d + 3)(2fc) ln((3)(2fc)) = k{8d + 12) ln(6fc), (9) 

where k is the number of pivots. 

Another example comes from considering the Hamming 
cube. As there are 2"* points in a d-dimensional Hamming 
cube, and at most d different radii, so at most d2'^ different 
balls exist. We know from e.g. |Blu89| that if the class 
A is finite, its VC dimension is bounded by log2 |^|. 
Disregarding the small leftover term, the VC dimension for 
balls in the Hamming cube is about d. 

Summarizing: 

Theorem VII.5. Let us denote by A the VC dimension of 
collection Ak as defined in equation (O. Then upper bounds 
on A, depending on the metric space, are as follows: 

. For (R'', L"^), A ^ fc(8d + 12) ln(6fc). 

. For {W^,L°°), A «C fc(16d + 4)ln(6fc). 

. For (E'', p), A < k{M + 8 log2 d + 4) ln(6fc). 

VIII. Main result 
Theorem VIII. 1. Consider a sequence of metric spaces 
(fid, Pd), where d = 1,2,3,... and the VC dimension of 
closed balls in (D,d, Pd) is 0{d). Assume every Qd supports 
a Borel probability measure pd so that for some C,c > 
the concentration functions ad of [fid, Pd, fJ-d) satisfy 



Ve > 0, ad{e) s: Ce 



-ce-'d 



Select for each d an i.i.d. sample Xd of size Hd from fid, 
according to pd, where the sample size Ud satisfies d = 



a;(log?T,d) and d ■ 



. Suppose further for every d a pivot 



index for similarity search is built using k pivots, where 

k = o{nd/d). 

Fix arbitrarily small e,rj > 0. Suppose we only ask queries 
whose radius is equal or greater to the distance to nearest 
neighbour of query centre q (z fid in Xd. 
Then there exists a D such that for all d ^ D, the 
probability that at least half the queries on dataset Xd 
take less than (1 — e)nd time is less than -q. 

Furthermore, if we allow the likelihood rj to depend on 
d, we can pick rjd so that the above holds true and 



lim 17(1 

d=D 



rid) 



1. 



We emphasize that this result is independent of the 
selection of pivots. 

Sketch of a proof: From Eq. ([3]) we know that, for a 
vast majority of query centres q, 



where M2 is some constant. 

We will sacrifice a certain number of sets of form Cq so 
that r can be considered a constant (see section [VTl l: we will 
proceed with at least half the queries having radius r above 
a constant independent of d. Hence the quantities that vary 
in d are n and k. Since d is superlogarithmic in n, 

yo Q, d> c log n 
=> Vc > 0, exp(— d) < exp(— clogrt) 
=> Vc> 0, exp(-d) < cn. 

So e"'*'" = o(n), and hence /i(Cg) = o{n). In fact this 
holds for at least half the queries q simultaneously, so: 

median sup /i(Cq) = 0(11). 
c. 

From the previous section, we know that only for large 
values of n will empirical measures be close (up to e) to 
actual measures with likelihood (I-77 ). The lower bound 
on n then naturally depends on e, 77 but also on the VC 
dimension A of the collection Ak- 

Let us fix e = 1/2 and assume r/ is bounded by some 
value less than 1. Then by pooling all constants, including 
e and 77 but not A we can rewrite expression (|7]i as: 



n ^ MiA, 



(10) 



where Mi > 1. What we would like to avoid is to have the 
right part of this expression grow linearly in n. We know 
an upper bound on A depends on k and d as established in 
Theorem I VII. 51 As our concern is for asymptotic behaviour 
we will simplify this bound to kdlnk . 

Combining d = o{n) with the asymptotic condition on k, 
we conclude that: 

A -0(71), 

and hence asymptotically we know that the right side of 
expression ( fTOb falls (much) under n. Therefore we are able 
to conclude: 

P(sup|/i#(C,)-Ai(C,)| >£) <r;, (11) 
c, 

which, combined with e = 1/2 and median sup^^ M(Cg) = 
o(n), gives the first part of the result. 
According to expression 



'2e2\ e'^n 
77 > exp ( Alog ( — j +log8 - — 



Assuming independent choices of the datasets Xd, and 
assuming that for each d the probability of an event is at 
least 1 — rjd, we aim to prove that 



lim n(l-7yrf) = l. 



d=D 



As rid goes to at least as fast as e it is enough to show 
that 



lim TT(l-e"'') = 1. 



(12) 



d=D 



Observing IIAsh71l that for any sequence ^ 77^ ^ 1, 

N N / ^ \ 

^ -^Vd ^Y[{1 - Vd) exp i^-T]d \ , 



we can extend this, for any D to: 



00 00 



d=D d=D \d=D ) 

Summing the geometric series, we obtain Eq. (fT2] i. ■ 

A. Conclusion 

We have established a rigorous asymptotically linear 
lower bound on the expected average performance of the op- 
timal pivot-based indexing schemes for similarity search in 
datasets randomly sampled from domains whose dimension 
goes to infinity. The examples given above of the various 
spaces exhibiting normal concentration of measure should 
convince the reader that many of the most naturally occuring 
domains and measure distributions are such. 

This is not the first lower bound result for pivoting 
algorithms for exact similarity search. A specific lower 
bound for pivot-based indexing already mentioned above is 
that of IChaOlbl : 

d log 71. 

This result assumes that k = 8(log7i). Furthermore and 
more importantly, the pivot selection is assumed to be 
random, as opposed to our (much stronger) bound that is 
applicable to any pivot selection technique. 

Other, more general asymptotic analyses considering more 
classes of indexing schemes IIWeb98l . MSha06l fix n or in the 
case of [Web98| also fail to distinguish between the dataset 
and the dataspace making results appear stronger than they 
actually are. 

The aim in MSha061 was to demonstrate that 



E(cost) 



1 



which came at the expense of any results on the rate of 
convergence. We chose instead to prove a weaker result, with 
convergence to some number close to 1/2 but with estimates 
on the rate of convergence. 



It should be assumed that the hypotheses of our paper are 
universal. Rather, our theoretical analysis confirms that at 
least in some settings, the curse of dimensionaUty for pivot- 
based schemes is indeed in the nature of data. Probably a 
more realistic situation from the viewpoint of applications 
would be that of an intrinsically low dimensional dataset 
contained in a high-dimensional domain, and performing 
an asymptotic analysis of various indexing schemes in this 
setting is an interesting open problem. 
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